Hands-on Exercise 3.1: Processing and Visualising Flow Data
16.1 Overview
Spatial interaction represent the flow of people, material, or information between locations in geographical space. It encompasses everything from freight shipments, energy flows, and the global trade in rare antiquities, to flight schedules, rush hour woes, and pedestrian foot traffic. Understanding what factors and logic went into the decision-making behind those human-induced movements and interdependencies is important because it enables policy makwers to better understand, predict, manage, and help plan for such circulation. For example, policy makers can make informed decisions about how to better allocate resources to improve traffic in a city or to speed up shipments of perishable foodstuffs. It has to do with having a good understanding of the overall situation.
Spatial Interaction Models (SIMs) are mathematical models for estimating movement between spatial entities developed by Alan Wilson in the late 1960s and early 1970, with considerable uptake and refinement for transport modelling since then Boyce and Williams (2015). There are four main types of traditional SIMs (Wilson 1971): - Unconstrained - Production-constrained - Attraction-constrained - Doubly-constrained
Both ordinary least square (OLS) and negative binomial (NB) regression methods have been used extensively to calibrate OD flow models by processing flow data as different types of dependent variables.
By the end to this hands-on exercise, we will be able to:
to import and extract OD data for a selected time interval,
to import and save geospatial data (i.e. bus stops and mpsz) into sf tibble data frame objects,
to populate planning subzone code into bus stops sf tibble data frame,
to construct desire lines geospatial data from the OD data, and
to visualise passenger volume by origin and destination bus stops by using the desire lines data.
16.2 Getting Started
For the purpose of this exercise, four R packages will be used. They are:
sf for importing, integrating, processing and transforming geospatial data.
tidyverse for importing, integrating, wrangling and visualising data.
tmap for creating thematic maps
stplanr for solving common problems in transport planning and modelling, such as how to best get from point A to point B
ggpubr for some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.
performance for for computing measures to assess model quality, which are not directly provided by R’s ‘base’ or ‘stats’ packages. The primary goal of the performance package is to provide utilities for computing indices of model quality and goodness of fit. These include measures like r-squared (R2), root mean squared error (RMSE)
16.3 Preparing the Flow Data
16.3.1 Importing the OD data
Import the Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv() of readr package.
Display the odbus tibble data table by using the code chunk below.
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
A quick check of odbus tibble data frame shows that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type. Hence, the code chunk below is used to convert these data values into character data type.
16.3.2 Extracting the study data
The data in odbus is generalised into weekend and weekday data. For the purpose of this exercise, we will extract commuting flows on weekday and between 6 and 9 o’clock. After the group-by and sum, the total rows reduced from 5,709,512 ro 241,503.
Print the content of odbus6_9
If we would like to, we could save the output in rds format for future use. We need to ensure that there exists a folder called ‘rds’ in ‘data’ folder before running the code chunk.
To read rds files:
16.4 Working with Geospatial Data
In this exercise, two geospatial data will be used. They are:
BusStop: This data provides the location of bus stop as at the third quarter of 2023. This data is refreshed quarterly by LTA. The last update was in July 2023.
MPSZ-2019: This data provides the sub-zone boundary of URA Master Plan 2019.
Both data sets are in ESRI shapefile format.
16.4.1 Importing geospatial data
Load the BusStop geospatial data using the st_read() function of sf package. Using the st_crs(busstop) will show that the coordinate system used is WSG84 (decimal deg). Using st_tranform(), we will convert the geographical coordinates system to SIngapore’s projected coordinate system crs=3414.
Note that there are repeated bus stop ids , however they have different bus stop roof ids and geometry values.
busstop <- st_read(dsn = "data/geospatial/BusStopLocation/BusStopLocation_Jul2023",
layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\BusStopLocation\BusStopLocation_Jul2023'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Next load the mpsz data. There are 332 planning subzones in Singapore.
mpsz <- st_read(dsn = "data/geospatial/MPSZ-2019",
layer = "MPSZ-2019") %>%
st_transform(crs = 3414)Reading layer `MPSZ-2019' from data source
`C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\MPSZ-2019'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N
1 MARINA EAST MESZ01 MARINA EAST ME CENTRAL REGION
2 INSTITUTION HILL RVSZ05 RIVER VALLEY RV CENTRAL REGION
3 ROBERTSON QUAY SRSZ01 SINGAPORE RIVER SR CENTRAL REGION
4 JURONG ISLAND AND BUKOM WISZ01 WESTERN ISLANDS WI WEST REGION
5 FORT CANNING MUSZ02 MUSEUM MU CENTRAL REGION
6 MARINA EAST (MP) MPSZ05 MARINE PARADE MP CENTRAL REGION
7 SUDONG WISZ03 WESTERN ISLANDS WI WEST REGION
8 SEMAKAU WISZ02 WESTERN ISLANDS WI WEST REGION
9 SOUTHERN GROUP SISZ02 SOUTHERN ISLANDS SI CENTRAL REGION
10 SENTOSA SISZ01 SOUTHERN ISLANDS SI CENTRAL REGION
REGION_C geometry
1 CR MULTIPOLYGON (((33222.98 29...
2 CR MULTIPOLYGON (((28481.45 30...
3 CR MULTIPOLYGON (((28087.34 30...
4 WR MULTIPOLYGON (((14557.7 304...
5 CR MULTIPOLYGON (((29542.53 31...
6 CR MULTIPOLYGON (((35279.55 30...
7 WR MULTIPOLYGON (((15772.59 21...
8 WR MULTIPOLYGON (((19843.41 21...
9 CR MULTIPOLYGON (((30870.53 22...
10 CR MULTIPOLYGON (((26879.04 26...
st_read()function of sf package is used to import the shapefile into R as sf data frame.st_transform()function of sf package is used to transform the projection to crs 3414.
16.5 Geospatial data wrangling
16.5.1 Combining Busstop and mpsz
Code chunk below populates the planning subzone code (i.e. SUBZONE_C) of mpsz sf data frame into busstop sf data frame. The output of st_intersection() is a point sf object. We do not need and therefore will drop the geometry. The number of observations reduced from 5,161 to 5,156 after operation, suggesting that 5 bus stops have been dropped as their point geometry is not within the polygon boundary of sf df mpsz.
st_intersection()is used to perform point and polygon overly and the output will be in point sf object.select()of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.five bus stops are excluded in the resultant data frame because they are outside of Singapore boundary.
Save the output into rds format
Next, we are going to append the planning subzone code from busstop_mpsz data frame onto odbus6_9 data frame. By doing so, we get the fields ‘ORIGIN_BS’, ‘DESTIN_BS” and ’ORIGIN_SZ’ all in a df .
# A tibble: 26 × 3
BUS_STOP_N SUBZONE_C LOC_DESC
<chr> <chr> <chr>
1 11009 QTSZ01 Ghim Moh Ter
2 11009 QTSZ01 GHIM MOH TER
3 82221 GLSZ05 BLK 3A
4 82221 GLSZ05 Blk 3A
5 22501 JWSZ09 Blk 662A
6 22501 JWSZ09 BLK 662A
7 96319 TMSZ05 Yusen Logistics
8 96319 TMSZ05 YUSEN LOGISTICS
9 43709 BKSZ07 BLK 644
10 43709 BKSZ07 BLK 644
# ℹ 16 more rows
The join columns will be ‘ORIGIN_PT_CODE’ from odbus6_9 df and ‘BUS_STOP_N’ from busstop_mpsz df. The columns will also be renamed.
Before left_join, odbus6_9 has 241,503 rows, after left join od_data has 242,235 rows.
Check for duplicate for proceeding
# A tibble: 794 × 5
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC
<chr> <fct> <dbl> <chr> <chr>
1 43709 43009 15 BKSZ07 BLK 644
2 43709 43009 15 BKSZ07 BLK 644
3 43709 43419 42 BKSZ07 BLK 644
4 43709 43419 42 BKSZ07 BLK 644
5 43709 43469 1 BKSZ07 BLK 644
6 43709 43469 1 BKSZ07 BLK 644
7 43709 43479 62 BKSZ07 BLK 644
8 43709 43479 62 BKSZ07 BLK 644
9 43709 43489 23 BKSZ07 BLK 644
10 43709 43489 23 BKSZ07 BLK 644
# ℹ 784 more rows
Remove the duplicated records. The od_data df reduced from 242,235 rows to 241,838 rows after moving duplicates.
Double check again
# A tibble: 0 × 5
# ℹ 5 variables: ORIGIN_BS <chr>, DESTIN_BS <fct>, TRIPS <dbl>,
# ORIGIN_SZ <chr>, LOC_DESC <chr>
Print the current od_data df to see what we are still lacking. We are will missing the destination subzone codes.
| ORIGIN_BS | DESTIN_BS | TRIPS | ORIGIN_SZ | LOC_DESC |
|---|---|---|---|---|
| 01012 | 01112 | 276 | RCSZ10 | HOTEL GRAND PACIFIC |
| 01012 | 01113 | 143 | RCSZ10 | HOTEL GRAND PACIFIC |
| 01012 | 01121 | 66 | RCSZ10 | HOTEL GRAND PACIFIC |
Again, get the destination subzone code for each destination bus stops by performing a left_join again with busstop_mpsz (contains subzone_c codes for each bus stop id).
After left_join, the number of rows increased from 241,838 rows to 242,831 rows.
Check for duplicates
# A tibble: 880 × 7
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC.x SUBZONE_C LOC_DESC.y
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 01013 51071 1 RCSZ10 ST JOSEPH'S CH CCSZ01 MACRITCHI…
2 01013 51071 1 RCSZ10 ST JOSEPH'S CH CCSZ01 MACRITCHI…
3 01112 51071 65 RCSZ10 OPP BUGIS STN EXIT C CCSZ01 MACRITCHI…
4 01112 51071 65 RCSZ10 OPP BUGIS STN EXIT C CCSZ01 MACRITCHI…
5 01112 53041 5 RCSZ10 OPP BUGIS STN EXIT C BSSZ01 Upp Thoms…
6 01112 53041 5 RCSZ10 OPP BUGIS STN EXIT C BSSZ01 Upp Thoms…
7 01121 51071 3 RCSZ04 STAMFORD PR SCH CCSZ01 MACRITCHI…
8 01121 51071 3 RCSZ04 STAMFORD PR SCH CCSZ01 MACRITCHI…
9 01239 51071 22 KLSZ09 SULTAN PLAZA CCSZ01 MACRITCHI…
10 01239 51071 22 KLSZ09 SULTAN PLAZA CCSZ01 MACRITCHI…
# ℹ 870 more rows
Remove duplicates
Sneak peak of the current od_data
| ORIGIN_BS | DESTIN_BS | TRIPS | ORIGIN_SZ | LOC_DESC.x | SUBZONE_C | LOC_DESC.y |
|---|---|---|---|---|---|---|
| 01012 | 01112 | 276 | RCSZ10 | HOTEL GRAND PACIFIC | RCSZ10 | OPP BUGIS STN EXIT C |
| 01012 | 01113 | 143 | RCSZ10 | HOTEL GRAND PACIFIC | DTSZ01 | BUGIS STN EXIT B |
| 01012 | 01121 | 66 | RCSZ10 | HOTEL GRAND PACIFIC | RCSZ04 | STAMFORD PR SCH |
The code chunk below will do the following:
Renames the destination ‘SUBZONE_C’ to ‘DESTIN_SZ’.
There are missing subzone codes for some of the origin and destination bus stop because the bus stops location in July 2023 could be more outdated than August bus stop 2023. We will drop columns with missing values.
Group-by origin subzone and destination subzone to generate a new field ‘MORNING_PEAK’ that contains the summation of all trips from subzone A to subzone B.
Take a look at our final od_data df
| ORIGIN_SZ | DESTIN_SZ | MORNING_PEAK |
|---|---|---|
| TMSZ02 | TMSZ02 | 350755 |
| WDSZ03 | WDSZ03 | 239791 |
| JWSZ08 | JWSZ09 | 234343 |
| BDSZ04 | BDSZ04 | 217917 |
| JWSZ09 | JWSZ09 | 153055 |
| YSSZ03 | YSSZ01 | 152800 |
| JWSZ04 | JWSZ04 | 149110 |
| TMSZ03 | TMSZ02 | 134196 |
| PRSZ05 | PRSZ03 | 103148 |
| JWSZ03 | JWSZ04 | 99666 |
Save the output into an rds file format.
16.6 Visualising Spatial Interaction
In this section, wewill learn how to prepare a desire line by using stplanr package.
16.6.1 Removing intra-zonal flows
We will not plot the intra-zonal flows. The code chunk below will be used to remove intra-zonal flows. It does so by removing the flows that originate and ends in the same subzone.
Rows reduced from 20,987 to 20,697.
16.6.2 Creating desire lines
In this code chunk below, od2line() of stplanr package is used to create the desire lines.
od_data1 is aspatial while mpsz is geospatial data.
Arguments
- flow
-
A data frame representing origin-destination data. The first two columns of this data frame should correspond to the first column of the data in the zones. Thus in
cents_sf(), the first column is geo_code. This corresponds to the first two columns offlow(). - zones
-
A spatial object representing origins (and destinations if no separate destinations object is provided) of travel.
- destinations
-
A spatial object representing destinations of travel flows.
- zone_code
-
Name of the variable in
zonescontaining the ids of the zone. By default this is the first column names in the zones.
The output flowLine is a sf LINESTRING object.
Simple feature collection with 20697 features and 3 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 5105.594 ymin: 25813.33 xmax: 49483.22 ymax: 49552.79
Projected CRS: SVY21 / Singapore TM
First 10 features:
ORIGIN_SZ DESTIN_SZ MORNING_PEAK geometry
1 AMSZ01 AMSZ02 11742 LINESTRING (29501.77 39419....
2 AMSZ01 AMSZ03 14886 LINESTRING (29501.77 39419....
3 AMSZ01 AMSZ04 3237 LINESTRING (29501.77 39419....
4 AMSZ01 AMSZ05 9349 LINESTRING (29501.77 39419....
5 AMSZ01 AMSZ06 2231 LINESTRING (29501.77 39419....
6 AMSZ01 AMSZ07 1714 LINESTRING (29501.77 39419....
7 AMSZ01 AMSZ08 2624 LINESTRING (29501.77 39419....
8 AMSZ01 AMSZ09 2371 LINESTRING (29501.77 39419....
9 AMSZ01 AMSZ10 183 LINESTRING (29501.77 39419....
10 AMSZ01 AMSZ11 930 LINESTRING (29501.77 39419....
16.6.3 Visualising the desire lines
To visualise the resulting desire lines, the code chunk below is used.
Arguments of tm_lines():
lwd: line width. Either a numeric value or a data variable. In the latter case, the class of the highest values (see style) will get the line width defined by scale. If multiple values are specified, small multiples are drawn (see details).
style: method to process the color scale when col is a numeric variable. Discrete gradient options are "cat", "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", "jenks", "dpih", "headtails", and "log10_pretty". A numeric variable is processed as a categorical variable when using "cat", i.e. each unique value will correspond to a distinct category
scale: line width multiplier number.
n: preferred number of color scale classes. Only applicable when lwd is the name of a numeric variable.
Rendering process takes about 1 min because of the transparency argument alpha.
When the flow data are very messy and highly skewed like the one shown above, it is wiser to focus on selected flows, for example flow greater than or equal to 5000 as shown below.
tmap_mode('view')
tmap_options(check.and.fix = TRUE)
tm_shape(mpsz) +
tm_polygons() +
flowLine %>%
filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
tm_lines(lwd = "MORNING_PEAK",
style = "quantile",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3)16.8 Viewing the Subzone spatial file
Simple feature collection with 10 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 8012.578 ymin: 15748.72 xmax: 35287.9 ymax: 31092.38
Projected CRS: SVY21 / Singapore TM
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N
1 MARINA EAST MESZ01 MARINA EAST ME CENTRAL REGION
2 INSTITUTION HILL RVSZ05 RIVER VALLEY RV CENTRAL REGION
3 ROBERTSON QUAY SRSZ01 SINGAPORE RIVER SR CENTRAL REGION
4 JURONG ISLAND AND BUKOM WISZ01 WESTERN ISLANDS WI WEST REGION
5 FORT CANNING MUSZ02 MUSEUM MU CENTRAL REGION
6 MARINA EAST (MP) MPSZ05 MARINE PARADE MP CENTRAL REGION
7 SUDONG WISZ03 WESTERN ISLANDS WI WEST REGION
8 SEMAKAU WISZ02 WESTERN ISLANDS WI WEST REGION
9 SOUTHERN GROUP SISZ02 SOUTHERN ISLANDS SI CENTRAL REGION
10 SENTOSA SISZ01 SOUTHERN ISLANDS SI CENTRAL REGION
REGION_C geometry
1 CR MULTIPOLYGON (((33222.98 29...
2 CR MULTIPOLYGON (((28481.45 30...
3 CR MULTIPOLYGON (((28087.34 30...
4 WR MULTIPOLYGON (((14557.7 304...
5 CR MULTIPOLYGON (((29542.53 31...
6 CR MULTIPOLYGON (((35279.55 30...
7 WR MULTIPOLYGON (((15772.59 21...
8 WR MULTIPOLYGON (((19843.41 21...
9 CR MULTIPOLYGON (((30870.53 22...
10 CR MULTIPOLYGON (((26879.04 26...
16.9 Isolating SUBZONE_C (subzone_code) into a new df
Sort mpsz based on values of SUBZONE_C column in ascending order.
Simple feature collection with 10 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 26154.57 ymin: 37511.2 xmax: 31072.47 ymax: 41804.65
Projected CRS: SVY21 / Singapore TM
SUBZONE_N SUBZONE_C PLN_AREA_N PLN_AREA_C REGION_N
171 ANG MO KIO TOWN CENTRE AMSZ01 ANG MO KIO AM NORTH-EAST REGION
170 CHENG SAN AMSZ02 ANG MO KIO AM NORTH-EAST REGION
163 CHONG BOON AMSZ03 ANG MO KIO AM NORTH-EAST REGION
330 TOWNSVILLE AMSZ04 ANG MO KIO AM NORTH-EAST REGION
329 SHANGRI-LA AMSZ05 ANG MO KIO AM NORTH-EAST REGION
172 KEBUN BAHRU AMSZ06 ANG MO KIO AM NORTH-EAST REGION
233 SEMBAWANG HILLS AMSZ07 ANG MO KIO AM NORTH-EAST REGION
254 TAGORE AMSZ08 ANG MO KIO AM NORTH-EAST REGION
242 YIO CHU KANG WEST AMSZ09 ANG MO KIO AM NORTH-EAST REGION
252 YIO CHU KANG AMSZ10 ANG MO KIO AM NORTH-EAST REGION
REGION_C geometry
171 NER MULTIPOLYGON (((29692.8 389...
170 NER MULTIPOLYGON (((30384.33 39...
163 NER MULTIPOLYGON (((30676.17 39...
330 NER MULTIPOLYGON (((29649.88 38...
329 NER MULTIPOLYGON (((28228.2 392...
172 NER MULTIPOLYGON (((28491.21 39...
233 NER MULTIPOLYGON (((27744.03 38...
254 NER MULTIPOLYGON (((27193.46 41...
242 NER MULTIPOLYGON (((29269.91 39...
252 NER MULTIPOLYGON (((29598.36 39...
16.10 Computing Distance Matrix
The are at least two ways to compute the required distance matrix. One is based on sf and the other is based on sp. Past experience shows that computing distance matrix by using sf function took relatively longer time that sp method. In view of this, sp method is used in the code chunks below.
16.10.1 Converting from sf data.table to SpatialPolygonDataFrame
Convert mpsz from simple feature collection to SpatialPolygonDataFrame.
class : SpatialPolygonsDataFrame
features : 332
extent : 2667.538, 56396.44, 15748.72, 50256.33 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 6
names : SUBZONE_N, SUBZONE_C, PLN_AREA_N, PLN_AREA_C, REGION_N, REGION_C
min values : ADMIRALTY, AMSZ01, ANG MO KIO, AM, CENTRAL REGION, CR
max values : YUNNAN, YSSZ09, YISHUN, YS, WEST REGION, WR
Exploration: How to access a SpatialPolygonDataFrame object of the sp package.
mpsz_sp['SUBZONE_N'][[1]]
mpsz_sp@data # class dataframe
mpsz_sp@polygons # class: list
mpsz_sp@polygons[[1]] # access the first polygon / subzone
mpsz_sp@polygons[[1]]@Polygons # access the slot in the polygon object that contains information about individual Polygons within the overall geometry
mpsz_sp@polygons[[1]]@Polygons[[1]] # same as above, enter another layer
mpsz_sp@polygons[[1]]@Polygons[[1]]@coords #get the coordinates of the first polygon / subzone
mpsz_sp@polygons[[332]]@Polygons[[1]]@coords #total of 333 subzones16.10.2 Computing the distance matrix
spDists(x, y = x, longlat = FALSE, segments = FALSE, diagonal = FALSE)
spDists returns a full matrix of distances in the metric of the points if longlat=FALSE, or in kilometers if longlat=TRUE; it uses spDistsN1 in case points are two-dimensional. In case of spDists(x,x), it will compute all n x n distances, not the sufficient n x (n-1).
Arguments
x: A matrix of n-D points with row denoting points, first column x/longitude, second column y/latitude, or a Spatial object that has a coordinates method
y: A matrix of n-D points with row denoting points, first column x/longitude, second column y/latitude, or a Spatial object that has a coordinates method
longlat: logical; if FALSE (default), Euclidean distance, if TRUE Great Circle (WGS84 ellipsoid) distance; if x is a Spatial object, longlat should not be specified but will be derived from is.projected(x)
segments: logical; if TRUE, y must be missing; the vector of distances between consecutive points in x is returned.
diagonal: logical; if TRUE, y must be given and have the same number of points as x; the vector with distances between points with identical index is returned.
The diagonals of the ouput (332 by 332) are all 0. Distance with itself. The unit of distance is if ‘m’ (euclidean?) and km if WSG84?
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0000 810.4491 1360.9294 840.4432 1076.792
[2,] 810.4491 0.0000 950.2937 1026.5876 1824.476
[3,] 1360.9294 950.2937 0.0000 808.0902 1964.059
[4,] 840.4432 1026.5876 808.0902 0.0000 1158.484
[5,] 1076.7916 1824.4762 1964.0592 1158.4844 0.000
We can use the code to check for the default arguments of a function quickly.
16.10.3 Get the sorted column and row names of out dist matrix
mpsz was previoulsy sorted by ‘SUBZONE_C’ in ascending order. The code below will extract only the column of ‘SUBZONE_C’.
16.10.4 Attaching SUBZONE_C to row and column for distance matrix matching ahead
We would like to set the column names and row names for our distance matrix .
colnames(dist): This is used to access or set the column names of the object. Note thatcolnames()is applicable to matrices and arrays.rownames(dist): This is used to access or set the row names of the object. Note thatrownames()is applicable to matrices and arrays.paste0(sz_names): This part creates a character vector by concatenating elements of thesz_namesvector without any separator. The resulting vector will be used as column names.
AMSZ01 AMSZ02 AMSZ03 AMSZ04 AMSZ05
AMSZ01 0.0000 810.4491 1360.9294 840.4432 1076.792
AMSZ02 810.4491 0.0000 950.2937 1026.5876 1824.476
AMSZ03 1360.9294 950.2937 0.0000 808.0902 1964.059
AMSZ04 840.4432 1026.5876 808.0902 0.0000 1158.484
AMSZ05 1076.7916 1824.4762 1964.0592 1158.4844 0.000
16.10.5 Pivoting distance value by SUBZONE_C
We will use the melt() function of the reshape2 package to convert wide-format data to long-format data. This function will convert wide-format data to a data frame with columns for each combination of row and column indices and their corresponding values.
To do the opposite, used cast().
Three new columns generated, (1) ‘var1’, (2) ‘var2’ and (3) ‘value’ containing the distance for the corresponding var1-var2 pair; thus rename to ‘dist’.
There are 110,224 rows in distPair due to 332P2 + 332 = 332*331 + 332. Number of possible permutations with replacement.
Var1 Var2 dist
1 AMSZ01 AMSZ01 0.0000
2 AMSZ02 AMSZ01 810.4491
3 AMSZ03 AMSZ01 1360.9294
4 AMSZ04 AMSZ01 840.4432
5 AMSZ05 AMSZ01 1076.7916
6 AMSZ06 AMSZ01 805.2979
7 AMSZ07 AMSZ01 1798.7526
8 AMSZ08 AMSZ01 2576.0199
9 AMSZ09 AMSZ01 1204.2846
10 AMSZ10 AMSZ01 1417.8035
16.10.6 Updating intra-zonal distances
The row contain subzone A to subzone A (distance = 0) can be removed by filtering.
Var1 Var2 dist
AMSZ01 : 331 AMSZ01 : 331 Min. : 173.8
AMSZ02 : 331 AMSZ02 : 331 1st Qu.: 7149.5
AMSZ03 : 331 AMSZ03 : 331 Median :11890.0
AMSZ04 : 331 AMSZ04 : 331 Mean :12229.4
AMSZ05 : 331 AMSZ05 : 331 3rd Qu.:16401.7
AMSZ06 : 331 AMSZ06 : 331 Max. :49894.4
(Other):107906 (Other):107906
The code chunk below adds a constant distance value of 50m into the intra-zones commute. The diagonals of dist matrix (if still a matrix) would have been 50m.
Var1 Var2 dist
1 AMSZ01 AMSZ01 50.0000
2 AMSZ02 AMSZ01 810.4491
3 AMSZ03 AMSZ01 1360.9294
4 AMSZ04 AMSZ01 840.4432
5 AMSZ05 AMSZ01 1076.7916
6 AMSZ06 AMSZ01 805.2979
7 AMSZ07 AMSZ01 1798.7526
8 AMSZ08 AMSZ01 2576.0199
9 AMSZ09 AMSZ01 1204.2846
10 AMSZ10 AMSZ01 1417.8035
Lastly, code chunk is used to save the data frame for future use.
16.11 Preparing flow data
The code chunk below is used to prepare the flow_data. od_data contains intra-zonal trips (unlike od_trip1 ). There are 310 unique origin subzone values and 311 unique destin subzone values.
16.11.1 Separating intra-flow from passenger volume df
Two new fields called ‘FlowNoIntra’ and ‘offset’ are created.
check code
Print flow_data
| ORIGIN_SZ | DESTIN_SZ | MORNING_PEAK | FlowNoIntra | offset |
|---|---|---|---|---|
| AMSZ01 | AMSZ01 | 2575 | 0 | 1e-06 |
| AMSZ01 | AMSZ02 | 11742 | 11742 | 1e+00 |
| AMSZ01 | AMSZ03 | 14886 | 14886 | 1e+00 |
Rows: 20,987
Columns: 5
Groups: ORIGIN_SZ [310]
$ ORIGIN_SZ <chr> "AMSZ01", "AMSZ01", "AMSZ01", "AMSZ01", "AMSZ01", "AMSZ01…
$ DESTIN_SZ <chr> "AMSZ01", "AMSZ02", "AMSZ03", "AMSZ04", "AMSZ05", "AMSZ06…
$ MORNING_PEAK <dbl> 2575, 11742, 14886, 3237, 9349, 2231, 1714, 2624, 2371, 1…
$ FlowNoIntra <dbl> 0, 11742, 14886, 3237, 9349, 2231, 1714, 2624, 2371, 183,…
$ offset <dbl> 1e-06, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1…
The ‘ORIGIN_SZ’ and ‘DESTIN_SZ’ fields are in character format. Let us convert to factor format
flow_data$ORIGIN_SZ <- as.factor(flow_data$ORIGIN_SZ)
flow_data$DESTIN_SZ <- as.factor(flow_data$DESTIN_SZ)
head(flow_data,5)# A tibble: 5 × 5
# Groups: ORIGIN_SZ [1]
ORIGIN_SZ DESTIN_SZ MORNING_PEAK FlowNoIntra offset
<fct> <fct> <dbl> <dbl> <dbl>
1 AMSZ01 AMSZ01 2575 0 0.000001
2 AMSZ01 AMSZ02 11742 11742 1
3 AMSZ01 AMSZ03 14886 14886 1
4 AMSZ01 AMSZ04 3237 3237 1
5 AMSZ01 AMSZ05 9349 9349 1
16.11.2 Combining passenger volume data with distance value
distPair is a df containing distances for all corresponding subzone pairs (including self, default to 50m). ‘var1’, ‘var2’, ‘dist’
flow_data is a df containing ‘origin_sz’, ‘destin_sb’ and ‘morning_peak’
We will now perform a left join with two sets join keys.
The output contains distance and total morning peak trips for each possible pairs of subzones (self included).
Before left join:
flow_data has 20,987 rows.
distPair has 110,224 rows (is the all possible pairs out of 332 subzones, order matters and with replacement.)
After join:
flow_data1 has 20,987 rows.
flow_data1 <- left_join(flow_data, distPair,
by = c('ORIGIN_SZ'= 'Var1',
'DESTIN_SZ'= 'Var2'))
glimpse(flow_data1)Rows: 20,987
Columns: 6
Groups: ORIGIN_SZ [310]
$ ORIGIN_SZ <fct> AMSZ01, AMSZ01, AMSZ01, AMSZ01, AMSZ01, AMSZ01, AMSZ01, A…
$ DESTIN_SZ <fct> AMSZ01, AMSZ02, AMSZ03, AMSZ04, AMSZ05, AMSZ06, AMSZ07, A…
$ MORNING_PEAK <dbl> 2575, 11742, 14886, 3237, 9349, 2231, 1714, 2624, 2371, 1…
$ FlowNoIntra <dbl> 0, 11742, 14886, 3237, 9349, 2231, 1714, 2624, 2371, 183,…
$ offset <dbl> 1e-06, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1e+00, 1…
$ dist <dbl> 50.0000, 810.4491, 1360.9294, 840.4432, 1076.7916, 805.29…
Print out
| ORIGIN_SZ | DESTIN_SZ | MORNING_PEAK | FlowNoIntra | offset | dist |
|---|---|---|---|---|---|
| AMSZ01 | AMSZ01 | 2575 | 0 | 1e-06 | 50.0000 |
| AMSZ01 | AMSZ02 | 11742 | 11742 | 1e+00 | 810.4491 |
| AMSZ01 | AMSZ03 | 14886 | 14886 | 1e+00 | 1360.9294 |
| AMSZ01 | AMSZ04 | 3237 | 3237 | 1e+00 | 840.4432 |
| AMSZ01 | AMSZ05 | 9349 | 9349 | 1e+00 | 1076.7916 |
| AMSZ01 | AMSZ06 | 2231 | 2231 | 1e+00 | 805.2979 |
16.12 Preparing Origin and Destination Attributes
16.12.1 Importing population data
The dataset used here is the Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 in csv format .(i.e. respopagesextod2011to2020.csv). This is an aspatial data file. It can be downloaded at Department of Statistics, Singapore, the specific link can be found here. Although it does not contain any coordinates values, but it’s ‘PA’ and ‘SZ’ fields can be used as unique identifiers to geocode to ‘PLAN_AREA_N’ and ‘SUBZONE_N’ of the MP14_SUBZONE_WEB_PL shapefile respectively.
# A tibble: 6 × 5
PA SZ AGE7_12 AGE13_24 AGE25_64
<chr> <chr> <dbl> <dbl> <dbl>
1 ANG MO KIO ANG MO KIO TOWN CENTRE 310 710 2780
2 ANG MO KIO CHENG SAN 1140 2770 15700
3 ANG MO KIO CHONG BOON 1010 2650 14240
4 ANG MO KIO KEBUN BAHRU 1050 2390 12460
5 ANG MO KIO SEMBAWANG HILLS 420 1120 3620
6 ANG MO KIO SHANGRI-LA 810 1920 9650
16.12.2 Geospatial data wrangling
Let us append the population data for different age group range to the mpsz df.
pop has 984,656 rows
mpsz has 332 rows
After join: 984,656 rows
Column selected are ‘PA’, ‘SZ’, ‘AGE7-12’, ‘AGE13-24’, ‘AGE25_64’ from pop df and ‘SUBZONE_C’ from mpsz df.
16.7.3 Preparing origin attribute
Summaries
OD matrix is often incomplete. Imagine trying to complete the OD matrix, it would involve us doing spatial interaction or OD surveys to find the missing values. There are 332 subzones in Singapore, and each survey is expensive,. In addition, OD matrix is constantly changing as flow patterns changes. We are trying to predict flows between origins and destinations. Flow could be thought of a function of (1) attribute of origin (propulsiveness) (2) attribute of destination (attractiveness) and (3) cost friction (like distance or transport cost or public transport stops). Assumption is that the benefits must outweigh the cost in order for flow to happen.
Gravity model takes into consideration the interaction between all origin and destination locations.
Potential model takes in consideration the interaction between a location and all other location pairs. (Good for measuring accessibility)
Retail model is commonly used by franchise like KFC / Pizza Hut to determine their area/region of service (aka delivery zones) for each outlet.
There are 4 variations in the Gravity model:
- Unconstrained: only the overall outflow is fixed and total outflow from origins = total inflow to destinations
- Origin constrained: outflow by origin is fixed.
- Destination constrained: inflow by destination is fixed.
- Doubly constrained: outflow by origin and inflow by destination is fixed.
To calculate flow from each origin to each destination, we need parameters like k, alpha, lambda and beta. The beta for distance is usually negative because we assume that there is an inverse relationship between interaction and distance, like Newtonian physics and laws of gravity.